Business Analytics

Linear Regression

Ayush Patel and Jayati Sharma

24 January, 2024

Pre-requisite

You already…

  • Have a knowledge of basic statistics
  • Understand univariate and multivariate linear regression
  • Understand linear regression with categorical variables

Before we begin

Please install and load the following packages

library(tidyverse)
library(MASS)
library(openintro)
library(compositions)
library(ISLR2)



Access lecture slide from the course landing page

About Me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Linear Regression Model

  • The lm() function is used to fit linear models in R
  • Elmhurst data from openintro package
  • To understand relation between gift aid and family income
model <- lm( gift_aid ~ family_income, data = elmhurst)
summary(model)

Call:
lm(formula = gift_aid ~ family_income, data = elmhurst)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.1128  -3.6234  -0.2161   3.1587  11.5707 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   24.31933    1.29145  18.831  < 2e-16 ***
family_income -0.04307    0.01081  -3.985 0.000229 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.783 on 48 degrees of freedom
Multiple R-squared:  0.2486,    Adjusted R-squared:  0.2329 
F-statistic: 15.88 on 1 and 48 DF,  p-value: 0.0002289

Strength of a Fit

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • R-squared \(R^2\): describes the amount of variation in the outcome variable that is explained by the least squares line

Variance of the outcome variable

var(elmhurst$gift_aid)
[1] 29.81795

Variance of the residuals

var.lm(model)
[1] 22.87325

Strength of a Fit

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • If we apply our least squares line, then this model reduces our uncertainty in predicting aid using a student’s family income

  • \((s_{outcome}^ 2 - s_{residual}^2)/s_{outcome}^2\)

  • (29800 - 22800)/29800 $  $= 24%

Strength of a Fit

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • There was a reduction of about 24% of the outcome variable’s variation by using information about family income for predicting aid using a linear model

  • Correlation between the two variables

cor(elmhurst$family_income, elmhurst$gift_aid)
[1] -0.4985561
  • R-sqaured corresponds exactly to the squared value of the correlation
  • r = −0.499 -> \(R^2\) = 0.25

Do it Yourself - 1

  • Use auto data from ISLR2
  • What is the regression equation for finding how horsepower is dependent on weight
  • Fit a model for the above equation and find coefficients and \(R^2\)
  • What does the value of \(R^2\) mean?

Adjusted R-squared

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • In multivariate regression, the equation for the line of best fit has some changes
  • The adjusted R-squared is calculated as : \(R^2_{adj}= 1 - (1 - R^2)(n-1/n-k-1)\)
  • where n = number of observations used to fit the model
  • k = number of predictor variables in the model
  • Note : a categorical predictor with p levels will contribute p − 1 to the number of variables in the model

Adjusted R-squared - Why?

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • The reasoning behind the adjusted R-squared lies in the degrees of freedom associated with each variance
  • Degrees of freedom - maximum number of independent values in dataset
  • Equal to n - k − 1 in multiple regression
  • Adjusted \(R^2\) formula helps correct this bias

Model Selection

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • Two common strategies for adding or removing variables in a multiple regression
  • Backward elimination starts with the full model
  • Variables are eliminated one-at-a-time from the model until we cannot improve the model any further
  • Forward selection is the reverse of the backward elimination technique
  • We add variables one-at-a-time until we cannot find any variables that improve the model any further

Model Selection

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • Criteria for implementing selection method - \(R^2\)
  • Eliminate or add variables depending on whether they lead to the largest improvement in \(R^2_{adj}\)
loans <- openintro::loans_full_schema %>%
  mutate(credit_util = total_credit_utilized/total_credit_limit)

loan_model <- lm(interest_rate ~ verified_income + debt_to_income + public_record_bankrupt +term + credit_util + issue_month, data = loans)
summary(loan_model)

Call:
lm(formula = interest_rate ~ verified_income + debt_to_income + 
    public_record_bankrupt + term + credit_util + issue_month, 
    data = loans)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.0116  -3.1376  -0.7338   2.3464  19.4852 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     2.234302   0.210123  10.633  < 2e-16 ***
verified_incomeSource Verified  1.099804   0.099626  11.039  < 2e-16 ***
verified_incomeVerified         2.667962   0.117801  22.648  < 2e-16 ***
debt_to_income                  0.022763   0.002959   7.692 1.58e-14 ***
public_record_bankrupt          0.489424   0.128773   3.801 0.000145 ***
term                            0.154173   0.003975  38.789  < 2e-16 ***
credit_util                     4.838323   0.163103  29.664  < 2e-16 ***
issue_monthJan-2018             0.048263   0.108881   0.443 0.657586    
issue_monthMar-2018            -0.047001   0.107379  -0.438 0.661606    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.334 on 9965 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.2486,    Adjusted R-squared:  0.248 
F-statistic: 412.2 on 8 and 9965 DF,  p-value: < 2.2e-16

Do it Yourself - 2

  • Use Credit data from ISLR2
  • Use balance as response and create a model that you think is good. Try out qualitative variables as well.
  • Observe how the value of \(R^2\) changes when you add/remove predictors. What do you infer from this?

Regression Diagnostics

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • Non-linearity of the response-predictor relationships
  • Correlation of error terms
  • Non-constant variance of error terms
  • Outliers
  • High-leverage points
  • Collinearity

Regression Diagnostics

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • Linearity - The data should show a linear trend
    • If there is indication of non-linear relation, try non-linear transformations on you predictors
  • Correlated Error terms - We assume that \(\epsilon_1, \epsilon_2, ...\epsilon_n\) are uncorrelated. This means that on reasonable deduction can be done about \(\epsilon_{n+1}\) from the information we have about \(\epsilon_n\).
    • What if they are?
    • recall that \(SE\) is calculated based on this assumption.
    • This means that in case there is correlation in error terms, we may end up trusting the model more than we should.
    • This is often seem in time-series
    • How to test - Durbin-Watson test, Ljung-Box Q test

Regression Diagnostics

Content for this topic has been sourced from the book ‘Introduction to Modern Statistics’. Please check out the book for detailed information.

  • Nearly normal residuals - residuals must be nearly normal
    • When residuals are not normal, it is usually because of outliers or concerns about influential points
  • Constant or equal variability - The variability of points around the least squares line should remain roughly constant

Regression Diagnostics

  • Collinearity - When collinearity exists between two variables, it is difficult to say how individually one predictor is associated with response.

    • Look at correlation matrix for all variables. (Not a catch all solution - multi-collinearity)
    • The VIF is variance inflation factor, the ratio of the variance of \(\hat\beta_j\) when fitting the full model divided by the variance of \(\hat\beta_j\) if fit on its own.
    • VIF for individual predictor can be computed by:

    \[VIF(\hat\beta_j) = \frac{1}{1-R_{X_j|X_{-j}}^2}\]

    • Two was to deal with this. Drop predictors with high VIF or combine predictors into a single one.

Thank You :)